New visualization tools for numeric distributional data tables

From pre-processing to interpretation

Antonio Irpino, Ph.D.

Dept. of Mathematics and Physics
University of Campania L. Vanvitelli
Caserta, Italy

Thursday, the 9th of November, 2023

Layout

1) Aggregate and distributional data

Distributions are the numbers of the future.
Schweizer (1984)

2) Visualizing a table of (1D) distributions

3) Visualizing a single row (through eye iris or flowers)

The greatest value of a picture is when it forces us to notice what we never expected to see.
John Tukey

4) Visualizing large distributional data tables (extending an heatmap)

Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise.
John Tukey

5) An application on Chile climatic data

Aggreagate and distributional data: numeric distributional data

Let’s see an example: BLOOD dataset from the HistDAWass R package.

It is a classical (in the Symbolic Data Analysis community) dataset describing

  • 14 typologies of patients;
  • 3 distributional variables;

after aggregating a set of raw data from a hospital. See: Billard and Diday (2006)

Cholesterol
Hemoglobin
Hematocrit
name V1 bins p1 V2 bins p2 V3 bins p3
[80 ; 100] 0.025 [12 ; 12.9] 0.050 [35 ; 37.5] 0.025
[100 ; 120] 0.075 [12.9 ; 13.2] 0.112 [37.5 ; 39] 0.075
[120 ; 135] 0.175 [13.2 ; 13.5] 0.212 [39 ; 40.5] 0.188
u1: F-20 [135 ; 150] 0.250 [13.5 ; 13.8] 0.201 [40.5 ; 42] 0.387
[150 ; 165] 0.200 [13.8 ; 14.1] 0.188 [42 ; 45.5] 0.287
[165 ; 180] 0.162 [14.1 ; 14.4] 0.137 [45.5 ; 47] 0.038
[180 ; 200] 0.088 [14.4 ; 14.7] 0.075
[200 ; 240] 0.025 [14.7 ; 15] 0.025
[80 ; 100] 0.013 [10.5 ; 11] 0.007 [31 ; 33] 0.046
[100 ; 120] 0.088 [11 ; 11.3] 0.039 [33 ; 35] 0.171
[120 ; 135] 0.154 [11.3 ; 11.6] 0.082 [35 ; 36.5] 0.295
[135 ; 150] 0.253 [11.6 ; 11.9] 0.174 [36.5 ; 38] 0.243
u2: F-30 [150 ; 165] 0.210 [11.9 ; 12.2] 0.216 [38 ; 39.5] 0.170
[165 ; 180] 0.177 [12.2 ; 12.5] 0.266 [39.5 ; 41] 0.072
[180 ; 195] 0.066 [12.5 ; 12.8] 0.157 [41 ; 44] 0.003
[195 ; 210] 0.026 [12.8 ; 14] 0.059
[210 ; 240] 0.013
[155 ; 170] 0.067 [10.8 ; 11.2] 0.133 [33.5 ; 35.5] 0.133
[170 ; 185] 0.133 [11.2 ; 11.6] 0.067 [35.5 ; 37.5] 0.267
[185 ; 200] 0.200 [11.6 ; 12] 0.134 [37.5 ; 39.5] 0.267
u14: M-80+ [200 ; 215] 0.267 [12 ; 12.4] 0.333 [39.5 ; 41.5] 0.133
[215 ; 230] 0.200 [12.4 ; 12.8] 0.200 [41.5 ; 43] 0.200
[230 ; 245] 0.067 [12.8 ; 13.2] 0.133
[245 ; 260] 0.066

The first two and the last typology of patient in the BLOOD dataset.

Numerical distributional dataset

A distributional dataset is a classical table with \(N\) observations on the rows and \(P\) variables, indexing the columns, such that the generic term \(y_{ij}\) is a numerical univariate distribution

\[y_{ij}\sim f_{ij}(x_j)\] where \(x_j\in D_j \subset \Re\) and \(f_{ij}(x_j)\geq 0\),

  • \(\int\limits_{x_j \subset D_j}f_{ij}(x_j)dx_{j}=1\), if the distribution has a continuous support;
  • \(\sum\limits_{x_j\in D_j}{ f_{ij}(x_j)}=1\), if the distribution has a discrete support.

The BLOOD dataset

The basic plot for the \(i\)-th observation

The \(i\)-th observation is the vector \(y_i=[y_{i1},\ldots,y_{ij},\dots,y_{iP}]\)

Steps :

  1. Domain discretization

    • For continuous variables. For each variable \(Y_j\) we consider the domain \(D_j\) and, fixing a \(K_j\) integer value we partition \(D_j\) into \(K_j\) equi-width intervals (bins) of values, such that: \[D_j=\left\{ B_{jk}=(a_k,b_k], \lvert \, b_k>a_k,\, k=1,\ldots,K_j\, , \bigcup_{k=1}^KB_{jk}=[\min(D_j),\max(D_j)],B_{jk}\cap B_{jk'}=\emptyset, \text{ for } k\neq k' \right\} \]
    • For discete variables. For each variable \(Y_j\) we consider the domain \(D_j\) and, being \(\# D_j=K_j\) the cardinality of \(D_j\), we consider the elements of \(D_j\).
  2. Choice of a divergent colour palette_ We consider a divergent color palette with \(K_j\) levels, such that \(K_1\) represent the lowest category and \(K_j\) the highest one.

  3. Stacked percentage barcharts We compute the mass observed in each bin/category for each \(y_{ij}\)
    For the \(Y_i\) observation, \(P\) bars are generated. The order of the bar can be decided accordingly to the user preferences, or can be suggested by a correlation analysis for all the data in advance (one may cluster the distributional variables using a hierarchical clustering based on the Wasserstein correlation and then using the order returned by after the aggregation).

  4. Polar coordinates Polar coordinates allow to represent the stacked barcharts as circles that mimics an Eye Iris.
    We called this plot Eye Iris plot (EI plot.)

Example using BLOOD data

The extremes of the domains of the variables

  • Range of Cholesterol [ 80 ; 270 ]

  • Range of Hemoglobin [ 10.2 ; 15 ]

  • Range of Hematocrit [ 30 ; 47 ]

Choice of \(K\) and of a color palette

We fix \(K=50\) and we will use a color palette from Red (low values), passing through Yellow (middle values) to Green (high values).

Now, let’s take the first observation

Recode the distribution according to \(K=50\) partition of the domains.

Since the bins represent classes of values, we can consider them as ranked levels of the domain.

We propose to see all the three distributions using a stacked percentage barchart as follows. Note that each level of color has a area that is proportional to the mass associated with each bin.

The dashed line is positioned at level \(0.5\) suggesting where the median of each distribution is positioned taking into consideration the level of color associated with the bin of the respective domain.

But, this kind of visualization is not so immediate for comparing several observations. Let’s see an example:

For this reason, we propose to use a plot based on polar coordinates, but adding pupil for reducing the distortion due to the polar transformation, as follows:

Since a human is able to catch eyes shapes and color, we believe that this kind of visualization can be more interpretable. For example, let’s see all the 14 observations together.

Interpretation

According to the filling colours we can compare both observations and distributional values.

The Enriched plot

We propose to add information about dispersion and skewness.

The dispersion

Each variable in the dataset may have a different dispersion. Each distributional variable has its dispersion accounted by its proper standard deviation \(\sigma_{ij}\). We normalize each standard deviation \(\sigma_{ij}\) by the maximum standard deviation of observed for the the \(j\)-th variable \(\max(\sigma_{ij})\) where \(i=1,\ldots,N\). A segment, centered in the respective sector, allow to perceive the dispersion associated with each distribution.

The skewness

Each \(y_{ij}\) has its skewness value computed via the Third standardized moment \(\gamma_{ij}\).

We represent the skewness of \(y_{ij}\) external to the dashed circle if it is positive, while it is positioned internally if it is negative. The distance from the dashed circle represent the absolute value of the skewness index. If the segment is very close to the dashed circle, it means that the distribution is almost symmetric.

An example applied to Hierarchical clustering

in a PCA

References

Billard, L., and E. Diday. 2006. Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley.

Thank you !